Identifying and Filtering Near-Duplicate Documents

نویسنده

  • Andrei Z. Broder
چکیده

The mathematical concept of document resemblance captures well the informal notion of syntactic similarity. The resemblance can be estimated using a fixed size “sketch” for each document. For a large collection of documents (say hundreds of millions) the size of this sketch is of the order of a few hundred bytes per document. However, for efficient large scale web indexing it is not necessary to determine the actual resemblance value: it suffices to determine whether newly encountered documents are duplicates or near-duplicates of documents already indexed. In other words, it suffices to determine whether the resemblance is above a certain threshold. In this talk we show how this determination can be made using a ”sample” of less than 50 bytes per document. The basic approach for computing resemblance has two aspects: first, resemblance is expressed as a set (of strings) intersection problem, and second, the relative size of intersections is evaluated by a process of random sampling that can be done independently for each document. The process of estimating the relative size of intersection of sets and the threshold test discussed above can be applied to arbitrary sets, and thus might be of independent interest. The algorithm for filtering near-duplicate documents discussed here has been successfully implemented and has been used for the last three years in the context of the AltaVista search engine.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Identification of Duplicate News Stories in Web Pages

Identifying near duplicate documents is a challenge often faced in the field of information discovery. Unfortunately many algorithms that find near duplicate pairs of plain text documents perform poorly when used on web pages, where metadata and other extraneous information make that process much more difficult. If the content of the page (e.g., the body of a news article) can be extracted from...

متن کامل

Duplicate Web Pages Detection with the Support of 2d Table Approach

Duplicate and near duplicate web pages are stopping the process of search engine. As a consequence of duplicate and near duplicates, the common issue for the search engines is raising the indexed storage pages. This high storage memory will slow down the process which automatically increases the serving cost. Finally, the duplication will be raised while gathering the required data from the var...

متن کامل

Auto-Grouping Emails For Faster E-Discovery

In this paper, we examine the application of various grouping techniques to help improve the efficiency and reduce the costs involved in an electronic discovery process. Specifically, we create coherent groups of email documents which characterize either a syntactic theme, a semantic theme or an email thread. All such grouped documents can be reviewed together leading to a faster and more consi...

متن کامل

A Near-duplicate Detection Algorithm to Facilitate Document Clustering

Web Ming faces huge problems due to Duplicate and Near Duplicate Web pages. Detecting Near Duplicates is very difficult in large collection of data like ”internet”. The presence of these web pages plays an important role in the performance degradation while integrating data from heterogeneous sources. These pages either increase the index storage space or increase the serving costs. Detecting t...

متن کامل

New Issues in Near-duplicate Detection

Near-duplicate detection is the task of identifying documents with almost identical content. The respective algorithms are based on fingerprinting; they have attracted considerable attention due to their practical significance for Web retrieval systems, plagiarism analysis, corporate storage maintenance, or social collaboration and interaction in the World Wide Web. Our paper presents both an i...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2000